Tractable approximations for probabilistic models: 
The adaptive TAP mean field approach 
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We develop an advanced mean field method for approxi- 
mating averages in probabilistic data models that is based on 
the TAP approach of disorder physics. In contrast to conven- 
tional TAP, where the knowledge of the distribution of cou- 
plings between the random variables is required, our method 
adapts to the concrete couplings. We demonstrate the valid- 
ity of our approach, which is sofar restricted to models with 
non-glassy behaviour, by replica calculations for a wide class 
of models as well as by simulations for a real data set. 



Probabilistic models (for a review see e.g. [Q) find 
widespread applications in many areas of data model- 
ing. Their goal is to explain complex observed data by 
a set of unobserved, hidden random variables based on 
the joint distribution of both sets of variables. The price 
that a modeler has to pay for the high degree of flexibil- 
ity of these models is the vast increase in computational 
complexity when the number of hidden variables is large. 

Both statistical inference about hidden variables and 
training usually require computation of marginal distri- 
butions of the hidden variables which for exact calcu- 
lation demands infeasible high dimensional sums or in- 
tegrals. Since similar types of calculations are ubiq- 
uitous in the computations of thermal averages, there 
is a great deal of interest in adopting approximation 
techniques from statistical physics. For a variety of 
cases, when a standard tool, the Monte Carlo sampling 
technique reaches its limits, a simple mean field (MF) 
method, which neglects correlations of random variables 
has yielded good results in a variety of probabilistic data 
models. The MF approximation yields a closed set of 
nonlinear equations for the approximate expectation val- 
ues of random variables which usually can be solved in 
a time that only grows polynomially in the number of 
variables. At present, there is a growing research activ- 
ity trying to overcome the limitations of the simple MF 
method by partly including the dependencies of variables 
but still keeping the approximation tractable (for a re- 
view see [^). 

Various researchers P-p^ have discussed applications 
of the so-called TAP MF approach, originating in the sta- 
tistical physics of disordered systems, first introduced by 
Thouless, Anderson and Palmer (TAP) to treat the 



Sherrington-Kirkpatrick (SK) model of disordered mag- 
netic materials . Under the assumption that the cou- 
plings (interactions) between random variables are them- 
selves drawn at random from certain classes of distribu- 
tions, the TAP equations become exact in the thermody- 
namic limit of infinitely many variables. Unfortunately, 
the Onsager correction to the simple, naive MF theory 
will explicitly depend on the distribution of these cou- 
plings. Two models with the same connectivities but 
different distributions for the couplings, like e.g. the SK 
model and the Hopfield model have different expres- 
sions for the Onsager corrections (see e.g. chapter 
XIII). 

In order to use the TAP method as a good approxi- 
mation for models of real data, the lack of knowledge of 
the underlying distribution of the couplings (which are 
usually functions of the observed data) should be com- 
pensated by an algorithm which adapts the Onsager cor- 
rection to the concrete set of couplings. Simply taking 
the correction from a theory that assumes a specific dis- 
tribution may lead to suboptimal performance. This let- 
ter presents a solution to this problem for an important 
class of probabilistic models. As a check of the validity 
of the approach, we show that our method leads to the 
exact results in the thermodynamic limit for large classes 
of probability distributions over the couplings. 

We will consider probabilistic models of the type 



P(S) = 



P(S) 
Zi9,J] 



■ exp 



SiJijSj -f ^ Si9i 



Kj 



(1) 



where the set S ~ {Si, . . . , Sn) denotes the (hidden) ran- 
dom variables of the model. Any observed (i.e. fixed) 
quantities are assumed to be encoded in the matrix J 
and the fields 0. The term p(S) = Y[j Pji^j) is a prod- 
uct distribution which also contains all constraints of the 
Si (the range, discreteness, etc). In its simplest version, 
when S is a real variable with positive measure p, the 
class of models (^ contains Ising models (such the SK 
and Hopfield models), Gaussian process models [||, prob- 
abilistic independent component analysis ]T6[ | and combi- 
natorial optimization problems Q . If we lift the restric- 
tions that all variables must be real random variables, 
we can treat a variety of important models with depen- 
dencies between the Si that are defined through a set of 



fields XijSi. We will give two examples. Bayesian 
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learning in single layer neural networks is described by 
a Gibbs distribution P(S) cx Po{S) Ojli ^^(E^i 
where S is a weight vector of the network being trained 
on a number of m data vectors with components Xij in 
a N dimensional space. Pq is a prior distribution of the 
weights and F is the Likelihood quantifying the good- 
ness of fit to the data Q. A second example is given 
by the class of Bayesian belief networks on a directed 
graph which are promising models for adaptive expert 
systems. They are defined by P(S) = ^'('5»|pa(S'j)) 
where Si G {0, 1} and pa denotes the parents of Si, i.e. 
the variables in the graph that feed their information 
into Si via directed bonds. A specific type is the sig- 
moid behef networks ll^, where P{Si\pa,{Si)) 



1+e'' 



with hi — J2j£pa{Si) ^i]^]- latter two models can be 
easily brought into the form (]l|) by the standard 'field- 
theoretic' trick of introducing Dirac (5-functions and their 
exponential representations using purely imaginary con- 
jugate variables S — {Si, . . . , S,n)- This leads to an aug- 
mentation of the space of variables to the set (S, S). The 
hatted variables have the complex single variable distri- 
butions p{S) = J ^Q-Sh^-H(h) ^g^Q neural 

network model and p{S) = J ^e-^''/(l + e'*) for the 
belief network (where m ~ N). The augmented coupling 

^ j , where B^j = 



matrix is of the form J = 

and A = for the neural network and A. 
for the belief net. 

We will derive both an adaptive TAP-like ap- 
proximation for the marginal distribution Pi{S) = 
lUj^idSjPiS) and the free energy P(J,6/) = 
— \nZ{3,9). The free energy corresponds to the nega- 
tive log probability of the observed data which can be 
used as a yardstick for deciding which model best fits to 
the data. 

Our derivation will be based on the cavity approach 
introduced by [|j . We will assume that we are not dealing 
with a glassy system with its many ergodic components, 
but that all averages are for a single state. This is (as 
shown for many of the teacher-student scenarios studied 
in the statistical mechanics of neural networks) usually 
expected to hold when the probabilistic model is well 
matched to the data. Defining the field hi = JijSj, 
the marginal distribution of Si can be written as 



P^{S) 



Y[dS,PiS) 



P.iS) 



(2) 



where we have introduced an effective single variable 
Hamiltonian Hi{S) with corresponding partition function 
Zi. Defining an auxiliary average over the distribution of 
the system with variable Si left out by (. . .)^^, we get 



H,iS) = \n{e^''') 



V 



E 

k 



^fc ok 



(3) 



where k^*-* are the cumulants of this cavity distribution. 



{hi)\i and = {h-;)\i - (h,)^^ etc. 



I.e. k\ ' 

The basic physical assumption, which is the major in- 
gredient of all cavity derivations of the TAP mean field 
theory is that all variables Sj have only weak mu- 
tual dependencies. Mathematically expressed within the 
so-called clustering hypothesis H this becomes equiva- 
lent to the vanishing of all cumulants Kj, with k > 2 for 
fully connected systems. In the case, where the Sj are 
real variables with positive measure, this corresponds to 
a central limit theorem for the cavity fields. Under this 



assumption, setting Vi = , we get 
{h.) = ^1 dSp^iS) 

P^{S) 



dS 



z,. 



{h,\, + V,{S,) (4) 
(5) 



for i — I, . . . , N . So far, the approach is well known. 
The new aspect of our paper is in the way we com- 
pute the ^'s. Since these reaction terms account for 
the weak influence between random variables, they can 
be computed self-consistently from the matrix of sus- 
ceptibilities Xij = ^gg'^ ■ We make the approxima- 
tion that upon differentiation, the V^'s are held constant 
which is consistent with the fact that the 1^'s are ex- 
pected to be selfaveraging quantities in the thermody- 
namic limit. Under this assumption we get from eq. (|^) 
Xij = Xii i^i] + Y^kiJik - VkSik)xk]) which can be solved 
with respect to x and yields x = {-^ ^ J)^^ i where 
A — diagjV^i + 1/ Xii} is a. diagonal matrix. The Fluctu- 
ation Dissipation Theorem (again assuming that we deal 
with a single state), shows that x also equals the matrix 
of correlations dj = {SjSk) — {Sj){Sk)- By specializing 
to the diagonal elements, we can compute as a function 
of {Sf) - (S',)2 by solving 



\nZ, 



(A -J) 



(6) 



for i = 1, . . . , N. The sets of equations (|^) together with 
(^ constitute the first main result of this letter. They 
yield closed sets of equations for the first and second mo- 
ments of Si which in turn enables us to approximate the 
full marginal distribution of Si and the correlation func- 
tions. For comparison we note that the naive mean field 
approximation (for real random variables) is obtained by 
setting Vi = 0. Selfinteractions ViSi determined by the 
the linear response method have also been introduced in 
JlOt as a heuristics to correct the naive MF equations for 
Boltzmann machines. A sanity check of the internal con- 
sistency of our approach is obtained by the fact that the 
matrix x niust be positive definite. (If a group of the 
variables are complex, this has to hold for the submatrix 
of the real random variables). 

The next task is to compute the adaptive TAP approx- 
imation to the free energy F{3,9) — — lnZ(J,0). It is 



2 



useful to generalize our model eq. (|^) to a one parameter 
class of models where the interaction J is replaced by sJ 
with < s < 1 and to define the Legendre transform 
(Gibbs free energy) by 



requires the calculation of the asymptotic scaling of 



$,(m,M) =^^(sJ + A,6> + 7) + ^7,' 



where 7^ and Xi are external fields conjugate to Si and Sf 
which must be chosen to extremize the right hand side 
and A is a diagonal matrix with entries . The solutions 
m'^ and M'^ of the sets of equations d„n^s — dui^ — 
0, determine the correct equilibrium expectation values 
{Si)s = ml and {Sf)s = Mf (the index indicates that 
the expectation is taken with parameter s). Our desired 
approximation to the free energy is finally obtained as 
F{3,0) = <i>i(m'', M"). To compute $1 we differentiate 
<i>s with respect to s, to show that 



$1 = $0 



ds 



(7) 



with XsAj = {SiSj)s - {Si)s{Sj)s. Inserting our TAP 
approximation Xs — i-^s — s3)^^ and integrating, we 
obtain 



A$ = i In det(A - J) - ^ + J H 



(8) 



with Xii — Mi — mf. The first two terms constitute 
the naive mean field approximation to $ and the last 
term A$ is the Onsager correction. Note, that this re- 
sult is not equivalent to a truncation of a power series 
expansion of $ to second order in s (a Plefka expansion 
|l8| ) but contains terms of all orders. A different way to 
derive this result is obtained from the observation that 
the functional form of the Onsager term Vi in the TAP 
equations does not depend on the specific single variable 
densities p(S). Hence, we may compute this universal 
form by calculating <i> for an exactly solvable model, i.e. 
for a Gaussian p and subtract the naive mean field part. 
This is related to the strategy used by Parisi and Potters 
1^ in order to derive the TAP equations for a spin glass 
model with orthogonal random matrix J. 

To check the significance of our approach, we will next 
show that it will give the correct results for the statis- 
tical mechanics in the thermodynamic limit N —^ 00 
for a large class of distributions of the random matrix 
J. For simplicity, we specialize to models with only 
one type of single variable distribution pi{S) = p{S). 
Selfaveraging properties of the models can be computed 
within the replica framework by averaging the free en- 
ergy over the distribution of the random matrix J. This 



i Tr(AJ) 



for the matrix 



the function Kn{A) = ^ in 

Aij — J22=i SiaSja, where the So, are n replicas of the 
variables. Following Ref. and assuming the scaling 
Kn{A) ~ TrG(A/iV) as N ^ 00 where the function G 
characterizes the random matrix ensemble, the averaged 
free energy will depend only on the single set of orderpa- 
rameters given by qab = j/J^i i^iaSib- This is character- 
istic for models with matrices J of extensive connectivity. 
E.g., the SK-model with coupling matrix of independent 
components of variance has G{r) — and the Hop- 
field model with J, 

/3 



fj.= l -I ~j 

with variance leads to G{r) = — ^(ln(l — fir) + fir). 
Under the assumption of replica symmetry, the averaged 
free energy / = — -^pnZjj is obtained by extremizing 



x'^x^; and independent x^ 



fiq, A) = -G(A) + A (gG"(A) + G'(A)) (9) 
Dzln [ dSp{S) exp \y^2qG"iA)zS + G'{A)S^^ 



with respect to the off-diagonal orderparameter q = qab 



dz 
/2tv 



V2 



and to A = qaa — q, where Dz 

We can show the correspondence for N ^ 00 oi the 
adaptive TAP method and replica theory. A disor- 
der average gives the conventional TAP result for the 
Onsager coefficients: Vi — V — 2G'(x) , with x — 
-^VJxjiJj. To compare the TAP Gibbs free energy 
eq. (^) with the replica symmetric free energy (|^), we 
compute / = — limjv^-y^oo 7^ [in / dm dMexp(— 7<i>)] ^ , 
where the paths of integration must be chosen such 
that the integral converges. The integral will be dom- 
inated by the values for m and M which fulfill the TAP 
equations. Evaluating this expression using the replica 
method shows that both free energies coincide, i.e. / = /. 
It is also possible to translate the condition of positive 
definiteness of the susceptibility matrix x into the ther- 
modynamic limit. We can show that this stability is sat- 
isfied for 1 — 2G"(x)-^ X]JXij]j > 0, which coincides with 
the well known AT stability condition of replica theory 

i- 

We have performed two types of simulations of the 
TAP approaches on Bayesian neural network learning 
problems. For the first case (Fig. [T]) we test the self- 
consistency of our method on a real data set, 'Sonar - 
Mines versus Rocks' |^ of size m = 104 with with binary 
class labels yj = ±1 and a TV = 60 dimensional input 
space. The prior is ^o(S) oc exp(— S • S/2) and the like- 
lihood F{hj) = 4){yihj/a), with hj = ^^x^jSi, (j){t) = 



Dz and a'^ 



— 0.5. We compute the prediction for the 
average (conjugate) cavity field = [hj) — Vj{Sj), 

using eq. (^. The fraction of negative terms yj(hj)\j 
equals the 'leave-one-out' estimate eioo which provides 
an important practical estimator for the generalization 
error of the network. If our theory takes the reaction of 
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FIG. 1. Test of self-consistency of TAP - yj{hj)\^j versus 
yj{hj)\^^'^^ . The stars/circles are for adaptive/conventional 
TAP. The inset shows the distribution of Vi with the thick 
hue indicating the conventional TAP solution. 



the remaining variables correctly into account, this pre- 
diction should be close to the 'exact' average cavity field 
obtained by leaving one example out and solving the TAP 
eqs. for the remaining to — 1 examples. Fig. ^ shows ex- 
cellent agreement between the two computations and we 
find eioo = ^loo^^ = 33/104. For comparison, the conven- 
tional TAP approach 1^ , which assumes a distribution of 
input data vectors with independent components, leads 



to a wrong result, eioo = 41/104 and ef^^'^* = 33/104. 

In the second set of simulations Fig. g we demonstrate 
that the adaptive TAP method yields the correct statisti- 
cal physics for the case of the linear Ising perceptron . 
This has prior distribution P{S) = ^S{S-1) + ^6{S+1) 
and likelihood F{hj) oc exp(— (y,; — hj)'^/2a'^), where we 
have chosen = 0.2 and = 60 in the simulations. See 
Ref. 1^ for a discussion of this model in the context of 
demodulation in communications systems. To compare 
with the replica results , we have generated inputs at 
random and outputs using a noise free teacher percep- 
tron sampled from the same prior. The small deviations 
between theory and TAP simulations close to the first 
order transition are attributed to hysteresis effects. 

It will be interesting to extend our adaptive TAP 
method to glassy systems (generalizing the ideas of chap- 
ter V in [^) where the present approach would fail, 
e.g. indicated by the appearance of negative eigenvalues 
in the susceptibility matrix. However, one may specu- 
late that in such cases solving the TAP equations may 
be highly nontrivial. Acknowledgments. We thank T. 
Tanaka for providing us with his preprint ||2l| . This 
research is supported by the Swedish Foundation for 
Strategic Research as well as the Danish Research Coun- 
cils through the Computational Neural Network Center 
(CONNECT) and the THOR Center for Neuroinformat- 




FIG. 2. Learning curve for the linear Ising perceptron - 
test error rate against number of training examples. The 
stars/circles are for adaptive/conventional TAP. The dashed 
lines are the replica result where the first vertical is the 
thermodynamic transition and the second the spinodal point 
where the meta-stable solution vanishes. The inset shows the 
corresponding normalized free energy ^/N. The simulations 
are averaged over 100 runs with error bars of the size of the 
symbols. 



ICS. 



[1] 
[2] 
[3] 

[4; 

[5 

[6: 

[9: 
[lo: 

[11 

[12 

[13 
[14 



M. Jordan, ed., Learning in Graphical Models, MIT Press 

(1999) . 

M. Opper and D. Saad, eds. Advanced Mean Field Meth- 
ods, Theory and Practice, MIT Press (2001). 
M. Opper and O. Winther, Neural Computation 12, 2655 

(2000) . 

M. Mezard and G. Parisi, Europhys. Lett. 2, 913 (1986). 
M. Mezard, G. Parisi and M. A. Virasoro, Spin Glass 
Theory and Beyond, Lecture Notes in Physics, 9, World 
Scientific (1987). 

M. Mezard, J. Phys. A (Math. Gen. 22, 2181 (1989). 

K. Y. M. Wong, Europhys. Lett. 30, 245 (1995). 

M. Opper and O. Winther, Phys. Rev. Lett. 76, 1964 

(1996). 

Y. Kabashima and D. Saad, Euro. Phys. Lett. 44, 668 
(1998). 

H. J. Kappen and F. B. Rodriguez, in Advances in Neural 
Information Processing Systems 11 eds. M. S. Kearns, S. 
A. SoUa and D. A. Cohn, 280-286 MIT Press (1999). 
T. Tanaka, Phys. Rev. E 58, 2302-2310 (1998). 

C. Bhattacharyya and S. S. Keerthi, J. Phys. A (Math. 
Gen.) 10, 1307 (2000). 

D. J. Thouless, P. W. Anderson and R. G. Palmer, Phil. 
Mag. 35, 593 (1977). 

D. Sherrington, and K. Kirkpatrick, Phys. Rev. Lett. 35, 
1792 (1975). 



4 



[15] J. J. Hopfield, Proc. Nat. Acad. Sci. USA, 79 2554 (1982). 
[16] P.A.d.F.R. H0jen-S0rensen, O. Winthor, and L. K. 

Hansen, to appear in (NIPS'2000), MIT Press (2001). 
[17] R. Neal, Artificial Intelligence 56, 71-113 (1992). 
[18] T. Plefka, J. Phys. A 15, 1971 (1982). 
[19] G. Parisi and M. Potters, J. Phys. A (Math. Gen.) 28, 

5267 (1995). 

[20] R. P. Gorman and T. J. Sejnowski, Neural Networks 1, 
75 (1988). 

[21] T. Tanaka, to appear in (NIPS'2000), MIT Press (2001). 
[22] H. S. Seung, H. Sompohnsky and N. Tishby, Phys. Rev. 
A. 45, 6056 (1992). 



5 



